practical epidemiology portfolio
A) If you smoke in the past, how many Cigarette do you smoke? This continuous/ scale variable in SPSS. Scale variable can be describe as various intervals along levels of ratio measurement, where numeric values have certain code and meaning in terms of numeric relation; beyond any order or a category. Data obtained from this observation are used to measure central tendency and dispersion for descriptive analysis of studied population (Pallant, 2010).
B) The research question – how many cigarettes do you smoke can be changed to: Do you smoke more than 4 cigarettes a day?’ for binary transformation analysis. According to Bjartveit (2005) light smokers that smoke 1-4 cigarette a day are at more risk of dying from smoking related diseases than none smokers.
The binary outcome will determine the person that smokes 1-4 cigarettes a day among our cohort of smokers through the response of Yes or No. In the case of logistic regression the parameter will be used for analyses because the segmentation has been done on the basis of two ways. Predictors have to be segmented according to their different categorical scales (Menard, 2002). This yes or no should have to convert into 0 or 1, so that the quantitative analysis can be done in SPSS.
SPSS Output
What is your age in years? (Continuous Variable) is transformed form continuous variable to an ordinal variable. Ordinal variable are categorical variable that are arranged in order of intrinsic hierarchy and the difference in value might be relevant when calculating frequency using likert scale to determine the most frequent occurrence (Mode) and middle value (Median) in an observation to determine central tendency (Field,2013).
Transformation of continuous variable (Age) into ordinal will require putting age in category to determine a certain age group that is most likely affected certain disease compared to another but the extend of how much of this disease with the group may not be known.
A research question like – Is there a correlation between Alzheimer disease and people age 40 to 65? (Nhs.uk, 2015).Ordinal variable allows the researcher to determine if Alzheimer disease affects certain age group than the other. The groups are assigned a value proper sequencing the outcomes can be analysed using median or mode to determine central tendency (REF) In SPSS age group will be assigned a value and the value will be ranked in hierarchy.
SPSS Output
Research question
How easy is it to manage your household income? – Is an Ordinal survey question that will be transformed into binary in SPSS labelled ‘Do you manage your household income easily?’ This question will require the survey participant to answer YES or NO.
Research question – Does easy management of household income reduce poverty and crime rate among men aged 30 to 60years in United Kingdom? Bali Swain and Floro, (2014) in their research state that “Uncertainty faced by low- income household increases their vulnerability making poverty even more unbearable”.
SPSS Output
Nominal variable basically consist of several variable label applicable to population been observed without any quantitative value. Nominal variable can consist of many categories which can be reduced into fewer categories for simplicity of observational data (Wagner, 2007)
Research question – Is there an association between emotional/behavioural problem and studying? (Manthorpe, 2001) With this research question the employment category in the dataset will be transformed into dichotomous category, where one group will be student and the other group will be other participant that are not student. The student category will be the reference category compared with the other category when measuring association bet
A) Table showing number and percent of missing values for each variables in the dataset
B) Table showing mean and standard deviation before mean substitution
C) The table above is showing the standard deviation for variable labelled ‘what is your age in years’ is 38.226 before the Mean substitution is performed, but after the Mean substitution the value lessened by 0.8508 resulting to 37.3752.
There was a decrease of 43,168.1189 on the salary variable salary when the Mean substitution was calculated. The difference in the standard deviation for these variables with and without the Mean substitution can give room for excessive long number that need to be rounded -off , which are likely to cause an error when calculating (Leech et al., 2011).
Missing data of observations in a data set can be replaced with the average in order to complete case analysis method. The mean is used to replace missing data so that reliable result can be produced, and the mean remain unchanged. However this method can falsely decreases the differences in values scores in the data and changes the actual value of correlation by under estimating the value (Little and Rubin, 2002)
Likewise deletion gets rid of observation that are incomplete in the dataset, it only allow analyses on cases with complete information. This method encourage simplicity and it is unbiased if observation is missing completely at random (MCAR), although estimation can still leave room for bias if data is just missing at random (MAR) (Humphries, 2014).
Pairwise deletion method makes use of all recorded cases in which the variable of interest are still available and only delete observation with missing value, this helps to keep as many cases as possible within the data set but leaves no room for analysis being available for evaluation because sample are always different (Humphries, 2014).
Regression Imputation replaces all missing value with projected value from a regression equation which overestimates correlation but the information used in value projection is from the data being observed (Enders, 2010)
A boxplots to identify outliers for age
A boxplots for Age without Outliers
Boxplot for salary with Outliers
Boxplot for salary with outliers
Boxplot for salary without outliers
Outliers are extreme high and low value that seems out of point with other values in a dataset (Bremer, 1995). This outlier values are above upper quartile and below lower quartile of the interquartile range. This can lead to a misrepresenting in our data analysis such as descriptive statistic that are sensitive to outliers and distorting effect on central tendency calculation of the data set (Meyers, Gamst and Guarino, 2013) . Outliers might occur in a dataset due to an error in put or computer entering of data, the error in method of data collection or interpretation, the information might be completely unrelated or it can be due to natural variability of data (Reimann, 2008).
The outlier in age can be represented with missing value because the data point will have to be eliminated from the analysis because the value 448* is highly out of range with other values in the dataset. This high value suggests that the subject should not be in the data, this might have only happen due to data entry error. The method of elimination is known as Trimming in Statistic which is use to get rid of outliers that happened through sample bias, data entry error, measurement error, incorrect use of tool and where the data of reference is no longer available (Isshiki, 2009).
The outlier for salary will have to be replaced through the method known as Winsorizing, because the entry might not be an error; it could be a legitimate high score greater than other scores in the data set. In this process the outlier in salary can be replaced with the next highest value on the dataset. The process can not apply to more than three of such outlier in a data set (Balakrishnan, Barnett and Lewis, 1995).
Week 12 Portfolio Content Three
Statistics
According to Bowers (1997) Descriptive analysis is used to describe the characteristics of a sample that infer the population in a data set. In d above statistical analysis the first part of the table illustrates the total number of values for participant age variable is 9281 without any missing value. The second part discuss the level is the measures of central tendency which is use to note frequency of observation these includes the Mean (average), Mode and Median. The Mean describes the average of value for Age last Birthday in the observation is 39.15. The Median is the middle number in the frequency of numbers recorded which is 39. There is no much variance in the value of median and mean. The central measurement tendency indicates a perfect symmetric, “bell-shaped” or normality in data distribution, however if the Mean is greater or less than the Median the data distribution becomes positive or negative (Aschengrau and Seage, 2008).
The third part is the dispersion level, which entails the Standard deviation and Skewness, Range which is derived from the difference between the Maximum and Minimum values of data in the distribution. The Standard deviation is used to know how far away individual data point is from the Mean (Ðoric et al., 2007) and how spread out the values used for the variable in single sample size, whilst Standard Error calculates spread out of all average value possible in the population. Range is the difference between the highest and lowest value, it helps identify outliers in a dataset (Reimann, 2008). The skewness value is 0.137 which is > 0 this suggests that there is a positive skewness to our sample distribution (Bulmer, 1979).
The histogram is a visual representation of the Scale variable in SPSS. The Variable “Age last birthday” in data set HSE2002SMALL shows that data is normally distributed. Although value is clustered more to right, this simply explains that as we deviate from the centre, age value becomes infrequent (Field, 2013).
This is a Descriptive Pie chart showing the percentage of people in HSE 2002 survey showing the proportion of people within the data set that is trying to and not trying to change weight. Pie charts are used to present frequency distribution of same set of categorical qualitative variable in a pictorial form and it is mostly labelled in percentage (Kinnear and Gray, 2008)
The scatterplot is use to provide a visual representation of the relationship between independent variable (Age ) and dependent variable (Diastolic Blood Pressure). The independent variable on the X-axis predicts the dependent variable on the Y- axis, each point on the plot represents measurement on two variables for specific individual in the data set (ref)
The line of best fits indicates a positive a positive correlation on the scatter graph since the direction of the line runs from left down to the right upwards, however if it runs the opposite direction then that indicates a negative correlation (Wilcox, 2012)
The histogram above shows a normal distribution of valid mean waist/hip ratio among age group last birthday in the data set, this frequency description is a guide for which appropriate statistical test to be used, since there was a normal distribution in the histogram a Pearson correlation test can be perform however, if the distribution was abnormal a Spearman correlation will be performed. (REF)
Pearson Correlation
According to Dancey (2004) the interpretation of correlation coefficient guideline, the correlation coefficient among 9281 participants is 0.335 and the strength is weak. Although the p-value is significant as it is less than 0.5.
Variables Entered/Removeda
Model Variables Entered Variables Removed Method
1 Age last birthdayb . Enter
a. Dependent Variable: (D) Valid Mean Waist/Hip ratio
b. All requested variables entered.
This table reflects the variables used in the regression model
The model summary shows that 11.2% of the variance in valid mean waist/hip is explained age last birthday
The intercept indicates that valid waist and hip is 0.785 at age of zero however this is predicted to increase by 0.002 as age increase by one unit. The p-value is significant as it is below 0 and the odds ratio of 0.001 and 0.002 for the lower and upper confidence intervals irrespectively do not overlap zero and are statistical significant (Pallant, 2010).
The above histogram is the first stage before a T-test analysis that can be done compare means between categorical variable. The histogram shows a positive skewness in the distribution of the dependent variable. When the dependent variable is evenly distributed a T-test analysis can be run, but in a situation where the frequency in the dependent variable is skewed a Bar chart will then be generated to test the difference between the groups (Ho, 2013)
When the distribution is abnormal, a Non-parametric test like the one below will then be perform to see the distribution in the dependent variable (Woodley, 1994)
The Bar chart above shows distribution of total portion of fruit and vegetables between men and women in HSE2012 Survey.
The Mann- Whitney test is a test that assumes imaginary variances, In this test individual in a group are measured on a dependent variable and another set of individual in the other category are also measured on the same dependent variable. This test generates a hypothesis that can be accepted or rejected depending on the P-value.
The above hypothesis suggests the Alternate hypothesis is accepted because the p-value is below the alpha value of 0.05 (Howitt and Cramer, 2011)
This is a 2×2 table that illustrate the distribution of a nominal category of men and women who wears glasses. The cross tabulation table indicates that the observed count of men the wears glasses is 53% and 47%does not. Testing with the same dependent variable the percentage for women that wears glasses is 63.4% and 36.6% does not (Walsh and Ollenburger, 2001)
The Chi-square test is use to judge if an association pattern is by chance or there is definitely an indication of correlation but it does not tell how strong the significant between IV and DV. In the above Chi- square test the p-value is less than 0.05, and it is significant to reject a null hypothesis (Ho, 2013)
The model summary consists of two tests the Cox & Snell and Negelkerke R square. These models are used to explain the Percentage of variance in the dependent variable. Cox& Snell value is always lower that the Negelkerke (Kinnear and Gray, 2008)
The model Cox & Snell indicates that 2.4% of the variance DV is explained by the IV; while Negelkerke R squares test explained that 3.2% of the IV IS explained by the DV.
Both measures suggest that sex does not really explain very much how wearing glasses.
The odd ratio indicates that sex category has an increase chance of 1.845; there is an increase of 84.5% when compared to the reference group (male). The significance P-value gives the confident the correlation is not by chance. The upper and lower odd ratio cannot overlap since 1.845is lower than our odd ration and greater than it the upper value (Kinnear and Gray, 2008)
Binary logistic regression is a statistical test that allows an examination of more than one Independent Variable (IV) in a model to determine the impact on Dependant Variable (DV), it explains the extent at which IV can change or influence DV (Pallant, 2010)
The binary is examining the effect of various behaviour risk factor including age and sex on Diastolic high blood pressure.
DV – Diastolic high blood pressure and IV – was alcohol consumption, and other co-founder was sex, age, smoking, and weight.
Model 1
Both R-Square from the model summary indicates that there is very little of the variance in the DV is explained by the IV; has their value s are close to zero.
The Variable in equation Draft – total units of Alcohol has a P-value of 0.009 which is significant and an increased odd ratio of 0.5%, there is a 95% confidence that the correlation is unlikely to be by chance.
Model 2
In the second model after including smoking the R Square of cox & Snell and Negelkerke increased in value from 0.001and 0.002 to 0.003 and 0.008 correspondingly. The odd ratio in the main independent variable increased from 0.5% to 0.6%, which suggests there is no much difference in the relationship between the independence variable on the dependent variable (ref). The P- Value is less than 0.05 and odds of developing HBP among the smoking categories is increased by 17.2%. There is a 95% confidence is unlikely to be by chance (Steurer, 2002)
MODEL 3
Sex has been introduced as an independent model, the Cox& Snell and Negelkerke R Square was increased by 0.002 and 0.006 respectively. The odd ratio decreased by 0.2% in the main IV alcohol consumption when sex was added to the model. This indicates that there is no much impact on the DV. The P-value is less than 0.05. The odds of sex been a risk to developing HBP is reduced by 3.5% there is 95% confidence that the relationship could be chance (Kinnear and Gray, 2008)
MODEL4
After adding age unto the model both Cox &Snell and Negelkerke R square increased to 0.021 and 0.063 one after the other. The odd ratio of effect of alcohol on the DV has increased by 0.8% when age was included as an IV.As age increases by one unit the chance HBP of increases by 0.032.The P-value was less than 0.005 which is the alpha value (King, Rosopa and Minium, 2011)
MODEL 5
The Cox & Snell R square has increased from 0.021 to 0.024 and the Negelkerke from 0.063 to 0.069. There had been a slight change in both models. Weight has been added to other IV and it increased the odds ratio of alcohol as a risk of HBP by 1.1% and has weight increases by one unit the chance of developing HPD increase by 0.011. The P-value is significant because it is below 0.005, there is a 95%confidence that the relationship is unlikely to be by chance.
Although alcohol consumption was the main independent variable, all other cofounders where also significant since all there P-values were less than 0.005 but smoking category had the highest impact on the DV, the people in the smoking category has the highest odds of developing HBP (Pallant, 2010).